Data Processing
&
Visualization

Intro

This document is for a workshop to provide some tools, tips, packages etc. that make data processing and visualization in R easier. It is oriented toward those who have had some exposure to R in an applied fasion, but would also be useful to someone coming to R from another programming language. It is not an introduction to R, nor does this document have anything to do with statistical analysis.

Note that this used to be slide-based, and so it still has a lot of short sentences. I got tired of the limitations of slides for code and other presentation, so it is now in this format.

Outline

Part 1

  • Overview of Data Structures
  • Input/Output
  • Vectorization and Apply functions

Part 2

  • Pipes, and how to use them
  • plyr, dplyr, tidyr
  • data.table

Part 3

  • Visualization with ggplot2
  • Adding Interactivity

Color coding in text:

  • emphasis
  • package
  • function
  • object/class
  • link

Data Structures

R has several core data structures:

  • Vectors
  • Factors
  • Lists
  • Matrices/arrays
  • Data frames

Vectors

Vectors form the basis of R data structures.

Two main types- atomic and lists, but I will treat lists separately.

Here is an R vector. The elements of the vector are numeric values.

x = c(1, 3, 2, 5, 4)
x
[1] 1 3 2 5 4

All elements of an atomic vector are the same type. Examples include:

  • character
  • numeric (double)
  • integer
  • logical

Factors

A important type of vector is a factor. Factors are used to represent categorical data structures.

x = factor(1:3, labels=c('q', 'V', 'what the heck?'))
x
[1] q              V              what the heck?
Levels: q V what the heck?

The underlying representation is numeric, but it is important to remember that factors are categorical.

They can’t be used as numbers would be, as the following demonstrates.

as.numeric(x)
[1] 1 2 3
sum(x)
Error in Summary.factor(structure(1:3, .Label = c("q", "V", "what the heck?": 'sum' not meaningful for factors

Matrices

With multiple dimensions, we are dealing with arrays.

Matrices are 2-d arrays, and extremely commonly used for scientific computing.

The vectors making up a matrix must all be of the same type.

  • e.g. all values in a matrix must be numeric.

Creating a matrix

Creating a matrix can be done in a variety of ways.

# create vectors
x = 1:4
y = 5:8
z = 9:12

rbind(x, y, z)   # row bind
cbind(x, y, z)   # column bind
matrix(c(x, y, z), nrow=3, ncol=4, byrow=TRUE)

Lists

Lists in R are highly flexible objects, and probably the most commonly used for data science.

They can contain anything as their elements, even other lists.

  • unlike vectors, whose elements must be of the same type.

Here is a list. We use the list function to create it.

x = list(1, "apple", list(3, "cat"))
x
[[1]]
[1] 1

[[2]]
[1] "apple"

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] "cat"

We often want to loop some function over a list.

for(elem in x) print(class(elem))
[1] "numeric"
[1] "character"
[1] "list"

Lists can, and often do, have named elements.

x = list("a" = 25, "b" = -1, "c" = 0)
x["b"]
$b
[1] -1

Data Frames

data.frames are a very commonly used data structure. They do not have to have the same type of element, and this is because the data.frame class is actually just a list.

  • As such, everything about lists applies to them

But they can also be indexed by row or column as well.

  • like matrices

There are other very common types of object classes associated with packages that are both a data.frame and other type of structure.

Creating a data frame

mydf = data.frame(a = c(1,5,2), 
                  b = c(3,8,1))

We can add row names also.

rownames(mydf) = paste0('row', 1:3)
mydf
     a b
row1 1 3
row2 5 8
row3 2 1

Exercise

Excercise #1

Create an object that is a matrix and/or a data.frame, and inspect its class or structure.

mydf = data.frame(A=1:3, B=letters[1:3])

Exercise #2

Create a list of 3 elements, the first of which contains character strings, the second numbers, and the third, the data.frame or matrix you just created.

mylist = list(c('a', 'b'), 1:3, mydf)

Thinking Exercises

  • How is a factor different from a character vector?

  • How is a data.frame the same as and different from a matrix?

  • How is a data.frame the same as and different from a list?

Input/Output

Standard methods of reading in data include the following functions:

  • read.table
  • read.csv
  • readLines

Base R also comes with the foreign package for reading in other types of files, especially other statistical packages. However, while you may see it still in use, it’s not as useful as what’s found in other packages.

Newer approaches

haven: Package to read in foreign statistical files

  • read_spss
  • read_dta

readxl: for excel files

rio: uses haven, readxl etc. but with just two functions for everything

  • import, export (also convert)

Faster approaches

readr: Faster versions of base R functions

  • read_csv
  • read_delim

These make assumptions after an initial scan of the data, but if you don’t have ‘big’ data, this won’t help much.

However, they actually can be used as a diagnostic.

  • pick up potential data entry errors.

data.table: faster read.table

  • fread

Typically faster than readr approaches.

Other Data

Note that R can handle many types of data.

Some examples:

  • JSON
  • SQL
  • XML
  • YAML
  • MongoDB
  • NETCDF
  • text (e.g. a novel)
  • shapefiles
  • google spreadsheets

And many, many others.

On the horizon

feather: designed to make reading/writing data frames efficient

Works in both Python and R. Still in early stages of development.

Indexing

Base R Indexing Refresher

Slicing vectors
letters[4:6]
[1] "d" "e" "f"
letters[c(13,10,3)]
[1] "m" "j" "c"
Slicing matrices/data.frames
myMatrix[1, 2:3]
Label-based indexing
mydf['row1', 'b']
Position-based indexing
mydf[1, 2]
Mixed indexing
mydf['row1', 2]

If the row/column value is empty, all rows/columns are retained.

mydf['row1',]
mydf[,'b']
Non-contiguous
mydf[c(1,3),]
Boolean
mydf[mydf$a >= 2,]

List/Data.frame extraction

[ : grab a slice of elements/columns

[[ : grab specific elements/columns

$ : grab specific elements/columns

@: extract slot for S4 objects

my_list_or_df[2:4]
my_list_or_df[['name']]
my_list_or_df$name

Vectorization & Apply

Boolean Indexing

Logicals are objects with values of TRUE or FALSE.

Assume x is a vector of numbers.

x = c(-1, 2, 10, -5)
idx = x > 2
idx
[1] FALSE FALSE  TRUE FALSE
x[idx]
[1] 10

Flexiblity

We don’t have to create an explicit index object first, as R indexing is ridiculously flexible.

x[x > 2]
x[x != 3]
x[ifelse(x > 2  & x !=10, T, F)]
x[{y = idx; y}]
x[resid(lm(y ~ x)) > 0]

Vectorized operations

Consider the following unfortunately coded loop:

for (i in 1:nrow(mydf)) {
  check = mydf$x[i] > 2
  if (check==TRUE) {
    mydf$y[i] = 'Yes'
  } 
  else {
    mydf$y[i] = 'No'
  }
}

Compare:

mydf$y = 'No'
mydf$y[mydf$x > 2] = 'Yes'

# alternative approach that doesn't use indexing
# mydf$y = ifelse(mydf$x > 2, 'Yes', 'No')

This gets us the same thing, and would be much faster.

Boolean indexing is an example of a vectorized operation.

The whole vector is considered.

  • Rather than each element individually
  • Any preprocessing is done once rather than n times

This is always faster.

Example: Log all values in a matrix.

mymatrix_log = log(mymatrix)

Way faster than looping over elements, rows or columns. Here we’ll let the apply function stand in for our loop, logging the elements of each column.

mymatrix = matrix(runif(100), 10, 10)
identical(apply(mymatrix, 2, log), log(mymatrix))
[1] TRUE
library(microbenchmark)
microbenchmark(apply(mymatrix, 2, log),
               log(mymatrix))
Unit: microseconds
                    expr    min     lq     mean median     uq     max neval
 apply(mymatrix, 2, log) 43.227 45.328 49.90904 45.928 48.780 133.883   100
           log(mymatrix)  2.701  3.002  3.41332  3.302  3.302  16.210   100

Many vectorized functions already exist in R.

They are often written in C, Fortran etc., and so even faster.

Apply functions

A family of functions allows for a succinct way of looping.

Common ones include:

  • apply
    • arrays, matrices, data.frames
  • lapply, sapply, vapply
    • lists, data.frames, vectors
  • tapply
    • grouped operations (table apply)
  • mapply
    • multivariate version of sapply
  • replicate
    • similar to sapply

Example

Standardizing variables.

for (i in 1:ncol(mydf)){
  x = mydf[,i]
  for (j in 1:length(x)){
    x[j] = (x[j] - mean(x))/sd(x)
  }
}

The above would be a really bad way to use R. The following is much cleaner and now you’d have a function you can use elsewhere.

apply will take a matrix, and apply a function over the margin, row or column, you want.

stdize <- function(x) {
  (x-mean(x))/sd(x)
}

apply(mydf, 2, stdize)   # 1 for rows, 2 for columnwise application

Timings

The previous demonstrates how to use apply.

However, there is a scale function in base R that uses a more vectorized approach under the hood.

The following demonstrates various approaches to standardizing the columns of the matrix, even using a parallelized approach.

The base R function requires very little code and beats the others.

mydf = matrix(rnorm(100000), ncol=1000)

stdize <- function(x) {
  (x-mean(x)) / sd(x)
}

doubleloop = function() {
  for (i in 1:ncol(mydf)) {
    x = mydf[, i]
    for (j in 1:length(x)) {
      x[j] = (x[j] - mean(x)) / sd(x)
    }
  }
}


singleloop = function() {
  for (i in 1:ncol(mydf)) {
    x = mydf[, i]
    x = (x - mean(x)) / sd(x)
  }
}


library(parallel)
cl = makeCluster(8)
clusterExport(cl, c('stdize', 'mydf'))
doParallel::registerDoParallel(cl)

test = microbenchmark::microbenchmark(doubleloop=doubleloop(),
                                      singleloop=singleloop(), 
                                      apply=apply(mydf, 2, stdize), 
                                      parApply=parApply(cl, mydf, 2, stdize),
                                      vectorized=scale(mydf), times=25)
stopCluster(cl)
test
Unit: milliseconds
       expr         min          lq       mean     median         uq        max neval
 doubleloop 2882.148089 2904.739076 2938.21281 2946.57065 2952.25532 3013.29805    25
 singleloop   28.989806   29.512434   33.08177   31.17037   32.42606   84.79185    25
      apply   32.167901   33.350945   33.96146   34.13204   34.64055   36.18983    25
   parApply   17.691611   18.213638   21.91323   19.63893   21.12156   73.60683    25
 vectorized    8.085578    8.491433   10.25583   10.66990   11.02532   12.62173    25

Apply functions

Benefits

  • Cleaner/simpler code
  • Environment kept clear of unnecessary objects
  • Potentially more reproducible
    • more likely to use generalizable functions
  • Might be faster
  • Parallelizable

NOT necessarily faster than explicit loops.

  • single loop over columns was as fast as apply
  • Replicate and mapply are especially slow

ALWAYS can potentially be faster than loops.

  • Parallelization: parApply, parLapply etc.

Personal experience

I use R every day, and rarely use explicit loops.

  • Note: no speed difference for a for loop vs. using while
  • If you must use an explicit loop, create an empty object of the dimension/form you need, and then fill it in.
    • Notably faster

I pretty much never use an explicit double loop, as a little more thinking about the problem will usually provide a more efficient path to solving the problem.

Apply functions

Apply functions and similar approaches should be a part of your regular R experience.

Other versions we’ll talk about have been optimized, but you need to know the basics in order to use those.

Any you still may need parallel versions.

Exercise

With the following matrix, use apply and the sum function to get row or column sums.

x = matrix(1:9, 3, 3)
apply()

With the following list, use lapply and sapply and the sum function to get sums for the elements. There is no margin to specify with on a list, so just supply the list and the sum function.

x = list(1:3, 4:6, 7:9)
lapply()
sapply()

sapply is actually just a wrapper for lapply. If you supply the argument simplified=F, it is identical. Otherwise, it attempts to return a vector or matrix.

Pipes

Note: More detail on much of this part is given in another workshop.

Pipes are operators that send what comes before to what comes after.

There are many different pipes, and some packages that use their own.

However, the vast majority of packages use the same pipe:

%>%

Here, we’ll focus on their use with the dplyr package. Later, we’ll use it for visualizations.

Example:

mydf %>% 
  select(var1, var2) %>% 
  filter(var1 == 'Yes') %>% 
  summary

Start with a data.frame %>%
    select columns from it %>%
    filter/subset it %>%
    get a summary

Using variables as they are created

We can use variables as soon as they are created.

mydf %>% 
  mutate(newvar1 = var1 + var2,
         newvar2 = newvar1/var3) %>% 
  summarise(newvar2avg = mean(newvar2))

Pipes for Visualization (more later)

Generic example:

basegraph %>% 
  points %>%
  lines %>%
  layout

The dot

Most functions are not ‘pipe-aware’ by default.

Example: pipe to a modeling function.

mydf %>% 
  lm(y ~ x)  # error

Other pipes could work in this situation.

  • e.g. %$% in magrittr.

But generally, one can use a dot.

  • The dot refers to the object before the previous pipe.
mydf %>% 
  lm(y ~ x, data=.)

Flexibility

Piping is not just for data.frames.

  • The following starts with a character vector.
  • Sends it to a recursive function (named ..).
    • .. is created on-the-fly.
  • After the function is created, it’s used on ., representing the string.
  • Result: pipes between the words.
c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
  .. <-  . %>%
    if (length(.) == 1)  .
    else paste(.[1], '%>%', ..(.[-1]))
  ..(.)
} 
[1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"

Put that in your pipe and smoke it René Magritte!

Summary

Pipes are best used interactively.

Extremely useful for data exploration.

Common in many visualization packages.

See the magrittr package for more pipes.

Tidyverse

What is the tidyverse?

The tidyverse consists of a few key packages-

  • ggplot2: data visualisation
  • dplyr: data manipulation
  • tidyr: data tidying
  • readr: data import
  • purrr: functional programming, e.g. alternate approaches to apply
  • tibble: tibbles, a modern re-imagining of data frames

And of course the tidyverse package which will load all of the above.

See also: lubridate, rvest, stringr and others in the ‘hadleyverse’.

What is tidy?

Tidy refers to data arranged in a way that makes analysis and visualization simpler.

In a tidy dataset:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

Think long before wide.

dplyr

Grammar of data manipulation.

Next iteration of plyr.

Focused on tools for working with data frames.

  • Over 100 functions

It has three main goals:

  • Make the most important data manipulation tasks easier.
  • Do them faster.
  • Use the same interface to work with data frames, data tables or database.

Some key operations:

  • select: grab columns
    • select helpers: one_of, starts_with, num_range etc.
  • filter/slice: grab rows
  • group_by: grouped operations
  • mutate/transmute: create new variables
  • summarize: summarize/aggregate
  • do: arbitrary operations

Various join/merge functions:

  • inner_join, left_join etc.

Little things like:

  • n, n_distinct, nth, n_groups, count, recode, between

No need to quote variable names.

An example

Let’s say we want to select from our data the following variables:

  • Start with the ID variable
  • The variables X1:X10, which are not all together, and there are many more X columns
  • The variables var1 and var2, which are the only var variables in the data
  • Any variable that starts with XYZ

How might we go about this?

Some base R approaches

Tedious, or multiple objects just to get the columns you want.

# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]

# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]

# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]

# or via subset
newData = subset(oldData, select = cols)
More

What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)?

# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]

And this is for fairly straightforward operations.

The dplyr way
newData = oldData %>% 
  filter(Z == 'Yes', Q == 'No') %>% 
  select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>% 
  top_n(n=50, var2) %>% 
  arrange(desc(var1))

An alternative

dplyr and piping is an alternative

  • you can do all this sort of stuff with base R
  • with, within, subset, transform, etc.

Even though the initial base R approach depicted is fairly concise, it still can potentially be:

  • noisier
  • less legible
  • less amenable to additional data changes
  • requires esoteric knowledge (e.g. regular expressions)
  • often requires new objects (even if we just want to explore)

tidyr

Two primary functions for manipulating data

  • gather: wide to long
  • spread: long to wide

Other useful functions include:

  • unite: paste together multiple columns into one
  • separate: complement of unite

Example

library(tidyr)
stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
                      X = rnorm(10, 0, 1),
                      Y = rnorm(10, 0, 2),
                      Z = rnorm(10, 0, 4) )
stocks %>% head
        time           X           Y          Z
1 2009-01-01 -1.13209158  0.90099099  1.3601499
2 2009-01-02  0.24670969  0.02995202 -0.4369245
3 2009-01-03 -0.18533572  0.54440515 -1.1737488
4 2009-01-04  0.17047486 -1.00201046 -3.9265851
5 2009-01-05 -0.02331142 -3.10298741 -7.2464021
6 2009-01-06 -1.29181733 -2.22339543 -0.3179180
stocks %>% gather(stock, price, -time) %>% head
        time stock       price
1 2009-01-01     X -1.13209158
2 2009-01-02     X  0.24670969
3 2009-01-03     X -0.18533572
4 2009-01-04     X  0.17047486
5 2009-01-05     X -0.02331142
6 2009-01-06     X -1.29181733

Note that the latter is an example of tidy data while the former is not.

Personal Opinion

The dplyr grammar is clear for a lot of standard data processing tasks, and some not so common.

Extremely useful for data exploration and visualization.

  • No need to create/overwrite existing objects
  • Can overwrite columns as they are created
  • Makes it easy to look at anything, and do otherwise tedious data checks

Drawbacks:

  • not as fast as data.table for many things
  • the mindset can make for unnecessary complication
    • e.g. no need to pipe etc. to create one new variable

On the horizon

multidplyr

  • Partitions the data across a cluster.
  • Faster than data.table (after partitioning)

data table

data.table works in a notably different way than dplyr. However, you’d use it for the same reasons.

Like dplyr, the data objects are both data.frames and a package specific class.

Faster subset, grouping, update, ordered joins etc.

In general, data.table works with brackets as in base R.

However, the brackets work like a function call!

  • Several key arguments
x[i, j, by, keyby, with = TRUE, ...]

Importantly: you don’t use the brackets as you would with data.frames.

library(data.table)
dt = data.table(x=sample(1:10, 6), g=1:3, y=runif(6))
class(dt)
[1] "data.table" "data.frame"

What i and j can be are fairly complex.

In general, you use i for filtering by rows.

dt[2]
dt[2,]
   x g         y
1: 1 2 0.6158981
   x g         y
1: 1 2 0.6158981

In general, you use j to select (by name!) or create new columns.

  • Define a new variable with :=
dt[,x]
dt[,z := x+y]  # dt now has a new column
dt[,z]
dt[g>1, mean(z), by=g]
dt
[1]  3  1 10  5  7  6
    x g          y         z
1:  3 1 0.01040662  3.010407
2:  1 2 0.61589805  1.615898
3: 10 3 0.50496462 10.504965
4:  5 1 0.57399499  5.573995
5:  7 2 0.32487892  7.324879
6:  6 3 0.28128785  6.281288
[1]  3.010407  1.615898 10.504965  5.573995  7.324879  6.281288
   g       V1
1: 2 4.470388
2: 3 8.393126
    x g          y         z
1:  3 1 0.01040662  3.010407
2:  1 2 0.61589805  1.615898
3: 10 3 0.50496462 10.504965
4:  5 1 0.57399499  5.573995
5:  7 2 0.32487892  7.324879
6:  6 3 0.28128785  6.281288

Dropping columns is awkward.

  • because j is an argument
dt[,-y]             # creates negative values of y
dt[,-'y', with=F]   # drops y, but now needs quotes
## dt[,y:=NULL]        # drops y, but this is just a base R approach
## dt$y = NULL
[1] -0.01040662 -0.61589805 -0.50496462 -0.57399499 -0.32487892 -0.28128785
    x g         z
1:  3 1  3.010407
2:  1 2  1.615898
3: 10 3 10.504965
4:  5 1  5.573995
5:  7 2  7.324879
6:  6 3  6.281288

Grouped operations

group-by, with creation of a new variable.

Note that these actually modify dt in place, a key distinction with dplyr.

dt1 = dt2 = dt
dt[,sum(x,y), by=g]                  # sum of all x and y values
   g        V1
1: 1  8.584402
2: 2  8.940777
3: 3 16.786252
dt1[,newvar := sum(x), by=g]         # add new variable to the original data 
    x g          y         z newvar
1:  3 1 0.01040662  3.010407      8
2:  1 2 0.61589805  1.615898      8
3: 10 3 0.50496462 10.504965     16
4:  5 1 0.57399499  5.573995      8
5:  7 2 0.32487892  7.324879      8
6:  6 3 0.28128785  6.281288     16
dt1
    x g          y         z newvar
1:  3 1 0.01040662  3.010407      8
2:  1 2 0.61589805  1.615898      8
3: 10 3 0.50496462 10.504965     16
4:  5 1 0.57399499  5.573995      8
5:  7 2 0.32487892  7.324879      8
6:  6 3 0.28128785  6.281288     16

We can also create groupings on the fly.

For a new summary data set, we’ll take the following approach.

dt2[, list(meanx = mean(x), sumx = sum(x)), by=g==1]
       g meanx sumx
1:  TRUE     4    8
2: FALSE     6   24

Faster!

  • joins: and easy to do (note that i can be a data.table)
dt1[dt2]
  • group operations: via setkey
  • reading files: fread
  • character matches: e.g. via chmatch

Timings

The following demonstrates some timings from here.

  • Reproduced on my own machine
  • based on 50 million observations
  • Grouped operations are just a sum and length on a vector.

By the way, never, ever use aggregate. For anything.

          fun elapsed
1:  aggregate  114.35
2:         by   24.51
3:     sapply   11.62
4:     tapply   11.33
5:      dplyr   10.97
6:     lapply   10.65
7: data.table    2.71

Ever.

Really.

Pipe with data.table

Can be done but awkward at best.

mydt[,newvar:=mean(x),][,newvar2:=sum(newvar), by=group][,-'y', with=FALSE]
mydt[,newvar:=mean(x), 
  ][,newvar2:=sum(newvar), by=group
  ][,-'y', with=FALSE
  ]

Probably better to just use a pipe and dot approach.

mydt[,newvar:=mean(x),] %>% 
  .[,newvar2:=sum(newvar), by=group] %>% 
  .[,-'y', with=FALSE]

My take

Faster methods are great to have.

  • Especially for group-by and joins.

Drawbacks:

  • Complex
  • The syntax can be awkward
  • It doesn’t work like a data.frame
  • Piping with brackets
  • Does not have its own ‘verse’

Compromise

If speed and/or memory is (potentially) a concern, data.table

For interactive exploration, dplyr

Piping allows one to use both, so no need to choose.

And on the horizon…

dtplyr

Coming soon to an R near you.

This implements the data table back-end for ‘dplyr’ so that you can seamlessly use data table and ‘dplyr’ together.

Or play with now. The following shows times for a grouped operation of a data frame of two variables, a random uniform draw of 5e7 values, and a grouping variable of 500k groups.

package timing
dplyr 10.97
data.table 2.71
dtplyr 2.70

Just for giggles I did the same in python with a pandas DataFrame and groupby operation, and it took 7+ seconds.

ggplot2

ggplot2 is an extremely popular package for visualization in R.

  • and copied in other languages/programs

It entails a grammar of graphics.

  • Every graph is built from the same few parts

Key ideas:

  • Aesthetics
  • Layers (and geoms)
  • Piping
  • Facets
  • Themes
  • Extensions

Strengths:

  • Ease of getting a good looking plot
  • Easy customization
  • A lot of data processing is done for you
  • Clear syntax
  • Easy multidimensional approach
  • Equally spaced colors as a default

Aesthetics

Aesthetics map data to aesthetic aspects of the plot.

  • Size
  • Color
  • etc.

The function used in ggplot to do this is aes

aes(x=myvar, y=myvar2, color=myvar3, group=g)

Layers

In general, we start with a base layer and add to it.

In most cases you’ll start as follows.

ggplot(aes(x=myvar, y=myvar2), data=mydata)

This would just produce a plot background.

Piping

Layers are added via piping.

The first layers added are typically geoms:

  • points
  • lines
  • density
  • text

ggplot2 was using pipes before it was cool, and so it has a different pipe.

Otherwise, the concept is the same as before.

ggplot(aes(x=myvar, y=myvar2), data=mydata) +
  geom_point()

And now we would have a scatterplot.

Examples

library(ggplot2)
data("diamonds"); data('economics')
ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point()

ggplot(aes(x=date, y=unemploy), data=economics) +
  geom_line()

In the following, one setting is not mapped to the data.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(aes(size=carat, color=clarity), alpha=.25) 

Stats

There are many statistical functions built in.

Key strength: you don’t have to do much preprocessing.

Quantile regression lines:

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_quantile()

Loess (or additive model) smooth:

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_smooth()

Bootstrapped confidence intervals:

ggplot(mtcars, aes(cyl, mpg)) + 
  geom_point() +
  stat_summary(fun.data = "mean_cl_boot", colour = "orange", alpha=.75, size = 1)

Facets

Facets allow for paneled display, a very common operation.

In general, we often want comparison plots.

facet_grid will produce a grid.

  • Often this is all that’s needed

facet_wrap is more flexible.

Both use a formula approach to specify the grouping.

facet_grid
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  facet_grid(vs ~ cyl, labeller = label_both)

facet_wrap
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  facet_wrap(vs ~ cyl, labeller = label_both, ncol=2)

Fine control

ggplot2 makes it easy to get good looking graphs quickly.

However the amount of fine control is extensive.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(aes(color=clarity), alpha=.5) + 
  scale_y_log10(breaks=c(1000,5000,10000)) +
  xlim(0, 10) +
  scale_color_brewer(type='div') +
  facet_wrap(~cut, ncol=3) +
  theme_minimal() +
  theme(axis.ticks.x=element_line(color='darkred'),
        axis.text.x=element_text(angle=-45),
        axis.text.y=element_text(size=20),
        strip.text=element_text(color='forestgreen'),
        strip.background=element_blank(),
        panel.grid.minor=element_line(color='lightblue'),
        legend.key=element_rect(linetype=4),
        legend.position='bottom')

Themes

In the last example you saw two uses of a theme.

  • built-in
  • specific customization

Each argument takes on a specific value or an element function:

  • element_rect
  • element_line
  • element_text
  • element_blank

Themes

The base theme is not too good.

  • not for web
  • doesn’t look good for print either

You will almost invariably need to tweak it.

Extensions

ggplot2 now has its own extension system, and there is even a website to track the extensions.

Examples include:

  • additional themes
  • interactivity
  • animations
  • marginal plots
  • network graphs

Summary ggplot2

ggplot2 is an easy to use, but powerful visualization tool.

Allows one to think in many dimensions for any graph:

  • x
  • y
  • color
  • size
  • opacity
  • facet

2d graphs are only useful for conveying the simplest of ideas.

Use ggplot2 to easily create more interesting visualizations.

Interactive Visualization

Packages

ggplot2 is the most widely used package for visualization in R.

However, it is not interactive by default.

Many packages use htmlwidgets, d3 (JavaScript library) etc. to provide interactive graphics.

General:

  • plotly
  • used also in Python, Matlab, Julia, can convert ggplot2 images to interactive ones.
  • highchartr
  • also very general wrapper for highcharts.js and works with some R packages out of the box
  • rbokeh
  • like plotly, it also has cross program support

Specific functionality:

  • DT
    • interactive data tables
  • leaflet
    • maps with OpenStreetMap
  • visNetwork
    • Network visualization

Piping for Visualization

One of the advantages to piping is that it’s not limited to dplyr style data management functions.

Any R function can be potentially piped to.

  • several examples have already been shown.

This facilitates data exploration, especially visually.

  • don’t have to create objects
  • new variables are easily created and subsequently manipulated just for vis
  • data manipulation not separated from visualization

htmlwidgets

Many newer visualization packages take advantage of piping.

htmlwidgets is a package that makes it easy to create javascript visualizations.

  • i.e. what you see everywhere on the web.

The packages using it typically are pipe-oriented and produce interactive plots.

plotly example

A couple demonstrations with plotly.

Note the layering as with ggplot2.

Piping used before plotting.

library(plotly)
midwest %>% 
  filter(inmetro==T) %>% 
  plot_ly(x=~percbelowpoverty, y=~percollege, mode='markers') 

plotly has modes, which allow for points, lines, text and combinations.

Traces work similar to geoms.

library(mgcv); library(modelr)

mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual')),
         hovertext = paste(wt, mpg, amFactor)) %>% 
  add_predictions(gam(mpg~s(wt, am, bs='fs'), data=mtcars)) %>% 
  arrange(wt) %>% 
  plot_ly(x=~wt, y=~mpg, color=~amFactor, mode='markers') %>% 
  add_trace(x=~wt, y=~pred, color=~amFactor, alpha=.5, hover=~hovertext, name='gam prediction')

ggplotly

The nice thing about plotly is that we can feed a ggplot to it.

It would have been easier to use geom_smooth, so let’s do so.

gp = mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual')),
         hovertext = paste(wt, mpg, amFactor),
         prediction = predict(gam(mpg~s(wt), data=mtcars))) %>% 
  arrange(wt) %>% 
  ggplot(aes(x=wt, y=mpg)) +
  geom_smooth() +
  geom_point(aes(color=amFactor))
ggplotly()

Highcharter

Highcharter is also fairly useful for a wide variety of plots.

  • Uses the highcharts.js library
library(highcharter); library("quantmod")

x <- getSymbols("GOOG", auto.assign = FALSE)

highchart(type='stock', width=1000) %>% 
   hc_add_series(data = x) 

visNetwork

visNetwork allows for network visualizations

  • Uses the vis.js library
library(visNetwork)
visNetwork(nodes, edges) %>% #, height=600, width=800
  visNodes(shape='circle', 
           font=list(), 
           scaling=list(min=10, max=50, label=list(enable=T))) %>% 
  visLegend()

data table

Use the DT package for interactive dataframes.

library(DT)
ggplot2movies::movies %>% 
  select(1:6) %>% 
  filter(rating>9) %>% 
  slice(sample(1:nrow(.), 50)) %>% 
  datatable(rownames=F)

Shiny

Shiny is a framework that can essentially allow you to build an interactive website.

  • Provided by RStudio developers

Most of the more recently developed visualization packages will work specifically within the shiny and rmarkdown settings.

Interactive and Visual Data Exploration

Interactivity allows for even more dimensions to be brought to a graphic.

Interactive graphics are more fun too!

  • But they must serve a purpose
  • Too often they are simply distraction, and detract from the data story

Just a couple visualization packages can go a very long way.

Summary

With the right tools, data exploration can be:

  • easier
  • faster
  • more efficient
  • more fun!

Use them to wring your data dry of what it has to offer.

Recommended next steps:



Embrace a richer understanding of your data!